Evaluating Natural Language Generated Database Records
نویسنده
چکیده
I n t r o d u c t i o n P r o j e c t M U R A S A K I The purpose of Project MURASAKI is to develop a foreign language text understanding system that will demonstrate the extensibility of message understanding technology3 In its current design, Project MURASAKI will process Spanish and Japanese text and extract information in order to generate records in both natural language databases, respectively. The fields within these database records will contain a natural language phrase or expression in that respective language. The domain of Project MURASAKI is the disease AIDS. The associated software system will include a general domain model of AIDS in the knowledge base. Within this model, there will be five subdomains: i n c ide nc e r e p o r t s records the occurrence of AIDS and HIV infection in countries and regions, among various populations, t e s t i n g pol ic ies covers measures to test groups for AIDS, c a m p a i g n s describes measures adopted to combat AIDS, new t e c h n o l o g i e s lists new equipment and material used in detecting and preventing AIDS, and 1Thus, it is no_...t to be confused as a message undel~tanding project, but rather a multi-paragraph (i.e., text) understanding project [51. A I D S r e s e a r c h details the various vaccines and treatments that are being developed to prevent AIDS. The subdomains of i n c i d e n c e r e p o r t s , t e s t i n g policies and c a m p a i g n s are found in the Spanish text while the topics of i n c i d e n c e r e p o r t s , n e w t echno log i e s and A I D S r e s e a r c h are covered in the Japanese text. Project MURASAKI will demonstrate a sufficient level of full text understanding to be able to identify the existence of factual information within either a given Spanish or Japanese text that belongs within a particular Spanish or Japanese language database. Then, it will determine what information in that text constitutes a single record in the selected database. The balance of this paper will focus on the evaluation technique: why it was chosen, some basic assumptions underlying it, as well as the design and application of this technique. To illustrate various technical points of this technique, examples will be given using text excerpted from the Spanish AIDS corpus and its associated (generated) Spanish database records. Appendix A contains a sample Spanish AIDS text (Text #124) and its English translation. 2 Appendix B contains a record from the Incidence Reporting database that was generated from Text #124. Similarly, Appendix C contains a record from the Testing Policies database that was also generated from Text #124. T h e N e e d f o r a B l a c k B o x Given the overall design of this foreign language text understanding program, there arose the need for developing a general purpose evaluation technique[l]. This technique would compare the actual, computer generated output of one such system to the expected, human generated output of another. Tha t is to say, given some sample piece of (foreign language) text as input, some predefined system output (namely, for project MURASAKI, the generation of a finite number of database records) could be manually generated so that a determination as to the correct performance of the computer system was made. Given this type of "correct" output , it could 2In the MURASAKI text corpus, there do not exist any English translations for any of the text. therefore be possible to measure the performance of an automated system based on this type of well-defined inpu t / ou tpu t pairs. I t was precisely this type of rationale that led to the development of a b l a c k b o x eva l u a t i o n evaluation primarily focused on what a system produces externally rather than what a system does internally. In direct contrast to this type of evaluation is g lass b o x e v a l u a t i o n "looking inside the system and finding ways of measuring how well it does something, rather than whether or not it does it" [5]. With the development of the MURASAKI evaluation technique, comes the notion of two types of measures: a quanti tat ive measure and a qualitative measure. The q u a n t i t a t i v e m e a s u r e determines the number of correct (and/or incorrect) records that have been generated in any one database while the q u a l i t a t i v e m e a s u r e evaluates the "correctness" of any database record field. Background S o m e A s s u m p t i o n s Given the overall design of Project MURASAKI, there are a few assumptions, or rather, some groundwork that needs to be laid, in order to proceed in the development of this evaluation technique. These assumptions are explained as follows: • Given the nature of the AIDS text corpus, any one text could possibly generate one or more records in one or more databases. This fact is loosely referred to as domain complexity. (Furthermore, for any record, all fields may not be filled.) • Given the structure of the AIDS domain model, it is just as easy (or hard) to distinguish one subdomain from another. Tha t is, each database is as likely to have a record generated in it as another. This hypothesis is known as subdomain differentiation. • Upon the determination of what the expected output of Project MURASAKI should resemble, a correct record (in any database) is uniquely identified by the contents of its key fields plus the contents of one or more non-key fields. This s ta tement constitutes the definition of a correct record. 3 G e n e r a t e d O u t p u t : W h a t C o u l d G o W r o n g ? After a thorough analysis of the system flow for Project MURASAKI and given a typical AIDS text as system input, the following list represents all possible undesirable situations that could arise: 3 All appropr ia te informat ion should be extracted f rom the text and placed in the correct database. A change in any of the key fields will result in the generat ion of a new record. For example, if data from a different t ime period is presented in the text, a key field change is required, and a new record is generated. If da ta from a new region is presented, a new record is generated. Examples of key and non-key fields are found in Appendices B and C. Key fields, which are found in the thick, darkened boxes, are the same throughout each database. 1. Generate one or more records in the w r o n g database. 2. N o t generate one or more records in the correct database. 3. Generate t o o m a n y records in the correct database, i.e., over-generate. 4. Generate t o o few records in the correct database, i.e., under-generate. 5. Generate t o o m a n y fields in the correct record. 6. Generate t o o few fields in the correct record. 7. Generate the w r o n g answer in the fields. Situations 1 and 2 illustrate what could go wrong at the database level while scenarios 3 and 4 depict possible problems arising at the database record level. The remaining criteria (namely 5, 6 and 7) shows what could happen at the database record field level. However, the more crucial way of viewing these problems is not so much in w h e r e (i.e., at what level) these events occur, but rather in h o w these problems can be detected and thus measured for evaluation purposes. I t is with this motivation that the following categorization was derived: a quantitative measure could be designed to account for the problems that could arise at both the database and database record levels while a qualitative measure could comparably be designed for evaluation at the database record field level. In the next section, two examples are given depicting how the quanti tat ive measure accounts for problems arising at the first two levels. (Note: ' rec. ' is the abbreviation for record in these examples.) A Quantitative Measure B a c k g r o u n d A scoring function is used for the quanti tat ive measure to calculate an aggregate score for the number of correct records (as defined previously) generated ( 'gem' in the following examples) for a given MURASAKI text. This scoring function assigns one point for the generation of a correct record ( 'coL') and p points, where 0 < p < 1, for the generation of an incorrect record ( 'inc. '). S o m e Q u e s t i o n s Given the two examples in Table 1, the following questions come to mind: • Wha t should be the value of p? ! ? i ? 17 Does 2" 3" 4" bounding it between 0 and 1 imply any linguistic restrictions on focus or coverage of the text? Or rather, should these bounds become parameters of this measure? Ex. # i : DB # I DB #2 DB #3 TOTAL Ex. #2: DB # i DB #2 DB #3 TOTAL 3 tee. 2 rec. 1 rec. 6 Text 124 3 rec. 1 rec. 0 rec. 4 Text xxx what if, where 2 gen. 2 gen. 2 gen. 1 cor. 2 cor. 2 inc. 1 inc. (1 inc.) 1-2p 2 -2p a-4p 6 what if, where 4 gen. 0 gen. 1 gen. 3 cor. 1 inc. 1 inc. 1 inc.
منابع مشابه
Design and Implementation of Association Rules Based System for Evaluating WSD
In this paper we presentnew method that usedassociation rules mining techniques in the field of natural language processing. Word sense disambiguation is persistently a central and challenging problem as increasing usage of internet in daily life. Every user has some queries that have to be searched on the internet. Transactional Database is created after preprocessing the text files. Ambiguous...
متن کاملOn Multi-subjectivity in Linguistic Summarization of Relational Databases
We focus on one of the most powerful computing methods for natural-language-driven representation of data, i.e. on Yager’s concept of a linguistic summary of a relational database (1982). In particular, we introduce an original extension of that concept: new forms of linguistic summaries. The new forms are named Multi-Subject linguistic summaries, because they are constructed to handle more tha...
متن کاملRevision of the stiletto fly genus Neodialineura Mann (Diptera: Therevidae): an empirical example of cybertaxonomy
The endemic Australian genus Neodialineura Mann is revised to include 13 species. Three species are previously described: N. nitens (White) and N. saxatilis (White) from southern mainland Australia and Tasmania, and N. striatithorax Mann from eastern Australia. Ten species are described as new, including N. ataxia sp. nov., N. atmis sp. nov., N. bagdad sp. nov, N. bifaria sp. nov., N. litura sp...
متن کاملSimplifying Syntactic and Semantic Parsing of NL Based Queries in Advanced Application Domains
The paper aims at presenting a natural (sub)language based querying approach (MDDQL) for SQL (relational, object-relational) databases, which relies on an ontology driven, interactive query construction mechanism. This guides the user to the construction of queries that are semantically compliant with the application domain semantics. To this extent, syntactic and semantic parsing of a query is...
متن کاملAssessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis
OBJECTIVE Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing. METHODS A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecogn...
متن کامل